from IPython.display import HTML

HTML('<p align="center"><div style="padding:75% 0 0 0;position:relative;"><iframe src="https://player.vimeo.com/video/739505155?autoplay=1&loop=1&h=b7141f3616&badge=0&autopause=0&player_id=0&app_id=58479" frameborder="0" allow="autoplay; fullscreen; picture-in-picture" allowfullscreen style="position:absolute;top:0;left:0;width:100%;height:100%;" title="Berkeley AIML CAPSTONE1"></iframe></div><script src="https://player.vimeo.com/api/player.js"></script></p>')

Overview: In this capstone, my goal is to do the heavy lifting of the project. I will explore my data source, testa a few of the techniqueues, and com up with a solution or answer to my research problem.
Research Problem
Data Source
Modeling Techniques To Be Used:
Expected Results
Why This Question Is Important
From our data source, we are presented with various data from over 10 years of clinical care at 130 US hospitals and integrated delivery networks. It includes over 40 features representing patient and hospital outcomes. Information was extracted from the database for encounters that satisfied the following criteria:
The data contains such attributes as patient number, race, gender, age, admission type, time in hospital, medical specialty of admitting physician, number of lab test performed, HbA1c test result, diagnosis, number of medication, diabetic medications, number of outpatient, inpatient, and emergency visits in the year before the hospitalization, etc.
Here we will use pandas to read in the dataset diabetic_data.csv and assign a meaningful variable name.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn import preprocessing
from tqdm import tqdm
from time import time
from sklearn.datasets import load_digits
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import BayesianRidge, LogisticRegression
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix
from sklearn import preprocessing
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.dummy import DummyClassifier
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from statsmodels.tools.eval_measures import rmse
from sklearn.inspection import permutation_importance
from sklearn import svm
from sklearn.svm import SVC
from pylab import rcParams
import dice_ml
import warnings
warnings.filterwarnings('ignore')
df = pd.read_csv('data/diabetic_data.csv', sep = ',')
df.head()
| encounter_id | patient_nbr | race | gender | age | weight | admission_type_id | discharge_disposition_id | admission_source_id | time_in_hospital | ... | citoglipton | insulin | glyburide-metformin | glipizide-metformin | glimepiride-pioglitazone | metformin-rosiglitazone | metformin-pioglitazone | change | diabetesMed | readmitted | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2278392 | 8222157 | Caucasian | Female | [0-10) | ? | 6 | 25 | 1 | 1 | ... | No | No | No | No | No | No | No | No | No | NO |
| 1 | 149190 | 55629189 | Caucasian | Female | [10-20) | ? | 1 | 1 | 7 | 3 | ... | No | Up | No | No | No | No | No | Ch | Yes | >30 |
| 2 | 64410 | 86047875 | AfricanAmerican | Female | [20-30) | ? | 1 | 1 | 7 | 2 | ... | No | No | No | No | No | No | No | No | Yes | NO |
| 3 | 500364 | 82442376 | Caucasian | Male | [30-40) | ? | 1 | 1 | 7 | 2 | ... | No | Up | No | No | No | No | No | Ch | Yes | NO |
| 4 | 16680 | 42519267 | Caucasian | Male | [40-50) | ? | 1 | 1 | 7 | 1 | ... | No | Steady | No | No | No | No | No | Ch | Yes | NO |
5 rows × 50 columns
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 101766 entries, 0 to 101765 Data columns (total 50 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 encounter_id 101766 non-null int64 1 patient_nbr 101766 non-null int64 2 race 101766 non-null object 3 gender 101766 non-null object 4 age 101766 non-null object 5 weight 101766 non-null object 6 admission_type_id 101766 non-null int64 7 discharge_disposition_id 101766 non-null int64 8 admission_source_id 101766 non-null int64 9 time_in_hospital 101766 non-null int64 10 payer_code 101766 non-null object 11 medical_specialty 101766 non-null object 12 num_lab_procedures 101766 non-null int64 13 num_procedures 101766 non-null int64 14 num_medications 101766 non-null int64 15 number_outpatient 101766 non-null int64 16 number_emergency 101766 non-null int64 17 number_inpatient 101766 non-null int64 18 diag_1 101766 non-null object 19 diag_2 101766 non-null object 20 diag_3 101766 non-null object 21 number_diagnoses 101766 non-null int64 22 max_glu_serum 101766 non-null object 23 A1Cresult 101766 non-null object 24 metformin 101766 non-null object 25 repaglinide 101766 non-null object 26 nateglinide 101766 non-null object 27 chlorpropamide 101766 non-null object 28 glimepiride 101766 non-null object 29 acetohexamide 101766 non-null object 30 glipizide 101766 non-null object 31 glyburide 101766 non-null object 32 tolbutamide 101766 non-null object 33 pioglitazone 101766 non-null object 34 rosiglitazone 101766 non-null object 35 acarbose 101766 non-null object 36 miglitol 101766 non-null object 37 troglitazone 101766 non-null object 38 tolazamide 101766 non-null object 39 examide 101766 non-null object 40 citoglipton 101766 non-null object 41 insulin 101766 non-null object 42 glyburide-metformin 101766 non-null object 43 glipizide-metformin 101766 non-null object 44 glimepiride-pioglitazone 101766 non-null object 45 metformin-rosiglitazone 101766 non-null object 46 metformin-pioglitazone 101766 non-null object 47 change 101766 non-null object 48 diabetesMed 101766 non-null object 49 readmitted 101766 non-null object dtypes: int64(13), object(37) memory usage: 38.8+ MB
df.describe()
| encounter_id | patient_nbr | admission_type_id | discharge_disposition_id | admission_source_id | time_in_hospital | num_lab_procedures | num_procedures | num_medications | number_outpatient | number_emergency | number_inpatient | number_diagnoses | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1.017660e+05 | 1.017660e+05 | 101766.000000 | 101766.000000 | 101766.000000 | 101766.000000 | 101766.000000 | 101766.000000 | 101766.000000 | 101766.000000 | 101766.000000 | 101766.000000 | 101766.000000 |
| mean | 1.652016e+08 | 5.433040e+07 | 2.024006 | 3.715642 | 5.754437 | 4.395987 | 43.095641 | 1.339730 | 16.021844 | 0.369357 | 0.197836 | 0.635566 | 7.422607 |
| std | 1.026403e+08 | 3.869636e+07 | 1.445403 | 5.280166 | 4.064081 | 2.985108 | 19.674362 | 1.705807 | 8.127566 | 1.267265 | 0.930472 | 1.262863 | 1.933600 |
| min | 1.252200e+04 | 1.350000e+02 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| 25% | 8.496119e+07 | 2.341322e+07 | 1.000000 | 1.000000 | 1.000000 | 2.000000 | 31.000000 | 0.000000 | 10.000000 | 0.000000 | 0.000000 | 0.000000 | 6.000000 |
| 50% | 1.523890e+08 | 4.550514e+07 | 1.000000 | 1.000000 | 7.000000 | 4.000000 | 44.000000 | 1.000000 | 15.000000 | 0.000000 | 0.000000 | 0.000000 | 8.000000 |
| 75% | 2.302709e+08 | 8.754595e+07 | 3.000000 | 4.000000 | 7.000000 | 6.000000 | 57.000000 | 2.000000 | 20.000000 | 0.000000 | 0.000000 | 1.000000 | 9.000000 |
| max | 4.438672e+08 | 1.895026e+08 | 8.000000 | 28.000000 | 25.000000 | 14.000000 | 132.000000 | 6.000000 | 81.000000 | 42.000000 | 76.000000 | 21.000000 | 16.000000 |
df.isnull().sum()
encounter_id 0 patient_nbr 0 race 0 gender 0 age 0 weight 0 admission_type_id 0 discharge_disposition_id 0 admission_source_id 0 time_in_hospital 0 payer_code 0 medical_specialty 0 num_lab_procedures 0 num_procedures 0 num_medications 0 number_outpatient 0 number_emergency 0 number_inpatient 0 diag_1 0 diag_2 0 diag_3 0 number_diagnoses 0 max_glu_serum 0 A1Cresult 0 metformin 0 repaglinide 0 nateglinide 0 chlorpropamide 0 glimepiride 0 acetohexamide 0 glipizide 0 glyburide 0 tolbutamide 0 pioglitazone 0 rosiglitazone 0 acarbose 0 miglitol 0 troglitazone 0 tolazamide 0 examide 0 citoglipton 0 insulin 0 glyburide-metformin 0 glipizide-metformin 0 glimepiride-pioglitazone 0 metformin-rosiglitazone 0 metformin-pioglitazone 0 change 0 diabetesMed 0 readmitted 0 dtype: int64

Here we will examine the data and determine if any of the features are missing values or need to be coerced to a different data type. Some features can be dropped due to being grossly empty. Other may need to be cleaned up further.
Input Variables:
Output variable (desired target):
#remove obvious irrelevant columns
df.drop(columns=['encounter_id','patient_nbr','weight', 'payer_code'], inplace=True)
#find all columns that contain values of '?'. This can throw off our data. We will need to correct this.
sub_df = df.loc[: , (df == '?').any()]
print(sub_df.columns)
Index(['race', 'medical_specialty', 'diag_1', 'diag_2', 'diag_3'], dtype='object')
#replace some values where it is marked as ? to unknown
df['race'] = df['race'].replace({'?':'Unknown'})
df['gender'] = df['gender'].replace({'?':'Unknown'})
df['medical_specialty'] = df['medical_specialty'].replace({'?':'Unknown'})
df['diag_1'] = df['diag_1'].replace({'?':'Unknown'})
df['diag_2'] = df['diag_2'].replace({'?':'Unknown'})
df['diag_3'] = df['diag_3'].replace({'?':'Unknown'})
#for the readmittance feature, lets replace none with Not Previously Admitted; less than 30 to Readmitted; and great than 30 to Readmitted
#we are waiving the 3 options down to 2. Our goal for this project is to simply get the usual readmitted patients to a point
#where they are not being readmitted. Thus, the binary classification of No Readmission vs Readmitted
df['readmitted'] = df['readmitted'].replace({'NO':'No Readmission'})
df['readmitted'] = df['readmitted'].replace({'<30':'Readmitted'})
df['readmitted'] = df['readmitted'].replace({'>30':'Readmitted'})
df['age'] = df['age'].replace({'[0-10)':5,
'[10-20)':15,
'[20-30)':25,
'[30-40)':35,
'[40-50)':45,
'[50-60)':55 ,
'[60-70)':65 ,
'[70-80)':75,
'[80-90)':85 ,
'[90-100)':95 })
#
#verify changes
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 101766 entries, 0 to 101765 Data columns (total 46 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 race 101766 non-null object 1 gender 101766 non-null object 2 age 101766 non-null int64 3 admission_type_id 101766 non-null int64 4 discharge_disposition_id 101766 non-null int64 5 admission_source_id 101766 non-null int64 6 time_in_hospital 101766 non-null int64 7 medical_specialty 101766 non-null object 8 num_lab_procedures 101766 non-null int64 9 num_procedures 101766 non-null int64 10 num_medications 101766 non-null int64 11 number_outpatient 101766 non-null int64 12 number_emergency 101766 non-null int64 13 number_inpatient 101766 non-null int64 14 diag_1 101766 non-null object 15 diag_2 101766 non-null object 16 diag_3 101766 non-null object 17 number_diagnoses 101766 non-null int64 18 max_glu_serum 101766 non-null object 19 A1Cresult 101766 non-null object 20 metformin 101766 non-null object 21 repaglinide 101766 non-null object 22 nateglinide 101766 non-null object 23 chlorpropamide 101766 non-null object 24 glimepiride 101766 non-null object 25 acetohexamide 101766 non-null object 26 glipizide 101766 non-null object 27 glyburide 101766 non-null object 28 tolbutamide 101766 non-null object 29 pioglitazone 101766 non-null object 30 rosiglitazone 101766 non-null object 31 acarbose 101766 non-null object 32 miglitol 101766 non-null object 33 troglitazone 101766 non-null object 34 tolazamide 101766 non-null object 35 examide 101766 non-null object 36 citoglipton 101766 non-null object 37 insulin 101766 non-null object 38 glyburide-metformin 101766 non-null object 39 glipizide-metformin 101766 non-null object 40 glimepiride-pioglitazone 101766 non-null object 41 metformin-rosiglitazone 101766 non-null object 42 metformin-pioglitazone 101766 non-null object 43 change 101766 non-null object 44 diabetesMed 101766 non-null object 45 readmitted 101766 non-null object dtypes: int64(12), object(34) memory usage: 35.7+ MB
df.shape
(101766, 46)
#Quick glance at distribution of data
df.hist(figsize=(20,15), bins=25)
array([[<AxesSubplot:title={'center':'age'}>,
<AxesSubplot:title={'center':'admission_type_id'}>,
<AxesSubplot:title={'center':'discharge_disposition_id'}>],
[<AxesSubplot:title={'center':'admission_source_id'}>,
<AxesSubplot:title={'center':'time_in_hospital'}>,
<AxesSubplot:title={'center':'num_lab_procedures'}>],
[<AxesSubplot:title={'center':'num_procedures'}>,
<AxesSubplot:title={'center':'num_medications'}>,
<AxesSubplot:title={'center':'number_outpatient'}>],
[<AxesSubplot:title={'center':'number_emergency'}>,
<AxesSubplot:title={'center':'number_inpatient'}>,
<AxesSubplot:title={'center':'number_diagnoses'}>]], dtype=object)
df.head()
| race | gender | age | admission_type_id | discharge_disposition_id | admission_source_id | time_in_hospital | medical_specialty | num_lab_procedures | num_procedures | ... | citoglipton | insulin | glyburide-metformin | glipizide-metformin | glimepiride-pioglitazone | metformin-rosiglitazone | metformin-pioglitazone | change | diabetesMed | readmitted | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Caucasian | Female | 5 | 6 | 25 | 1 | 1 | Pediatrics-Endocrinology | 41 | 0 | ... | No | No | No | No | No | No | No | No | No | No Readmission |
| 1 | Caucasian | Female | 15 | 1 | 1 | 7 | 3 | Unknown | 59 | 0 | ... | No | Up | No | No | No | No | No | Ch | Yes | Readmitted |
| 2 | AfricanAmerican | Female | 25 | 1 | 1 | 7 | 2 | Unknown | 11 | 5 | ... | No | No | No | No | No | No | No | No | Yes | No Readmission |
| 3 | Caucasian | Male | 35 | 1 | 1 | 7 | 2 | Unknown | 44 | 1 | ... | No | Up | No | No | No | No | No | Ch | Yes | No Readmission |
| 4 | Caucasian | Male | 45 | 1 | 1 | 7 | 1 | Unknown | 51 | 0 | ... | No | Steady | No | No | No | No | No | Ch | Yes | No Readmission |
5 rows × 46 columns
#pairplot for further visual observations
#sns.pairplot(df)
# Heat map to help show correlation matrix. Will allow aid in showing correlation between those numeric features
#fig, ax = plt.subplots(figsize=(10,8))
#sns.heatmap(df.corr(), annot=True)
Observations
#readmittance by race
fig, ax = plt.subplots(figsize=(15,5))
sns.countplot(data=df, x='race', hue='readmitted')
plt.title('Readmittance Occurence base on Race')
plt.legend(title='Readmittance', loc='upper right')
<matplotlib.legend.Legend at 0x20ba49aac70>
Observations
#Readmittance by gender
px.histogram(df, x='readmitted', color='gender', title='Readmittance Occurence based on gender')
Observations
fig, ax = plt.subplots(figsize=(15,5))
sns.countplot(data=df, x='age', hue='readmitted')
plt.title('Readmittance Occurence base on Age')
plt.legend(title='Readmittance', loc='upper right')
<matplotlib.legend.Legend at 0x20baf865d00>
Observations
dic_status = {1: "Emergency", 2: "Urgent", 3: "Elective", 4: "Newborn", 5: "Not Available", 6: "Not Noted", 7: "Trauma Center", 8: "Not Mapped"}
df["admission_type_id_mapped"] = df["admission_type_id"].map(dic_status)
fig, ax = plt.subplots(figsize=(15,7))
sns.countplot(data=df, x='admission_type_id_mapped', hue='readmitted')
plt.title('Readmittance Occurence base on the Admission Type')
plt.legend(title='Readmittance', loc='upper right')
plt.xlabel("Reason For Being Admitted")
plt.tick_params(axis='x', rotation=30)
Observations
dic_status = {1: " Physician Referral",
2: "Clinic Referral",
3: "HMO Referral",
4: "Transfer From Hospital",
5: "Transfer From SNF",
6: "Transfer From HCF",
7: "Emergency Room",
8: "Court/Law Enforcement",
9: "Not Available",
10: "Transfer From CAH",
11: "Normal Delivery",
12: "Premature Delivery",
13: "Sick Baby",
14: "Extramural Birth",
15: "Not Available",
17: "Not Available",
18: "Transfer From HHA",
19: "Readmission to Same HHA",
20: "Not Mapped",
21: "Unknown/Invalid",
22: "Transfer From Hospital Inpatient",
23: "Born Inside This Hospital",
24: "Born Outside This Hospital",
25: "Transfer From ASC",
26: "Transfer From Hospice"}
df["discharge_disposition_id_mapped"] = df["admission_type_id"].map(dic_status)
fig, ax = plt.subplots(figsize=(15,7))
sns.countplot(data=df, x='discharge_disposition_id_mapped', hue='readmitted')
plt.title('Readmittance Occurence base on the Original Admission Source')
plt.legend(title='Readmittance by Source', loc='upper right')
plt.xlabel("Admission Source")
plt.tick_params(axis='x', rotation=90)
Observations
fig, ax = plt.subplots(figsize=(15,7))
sns.countplot(data=df, x='time_in_hospital', hue='readmitted')
plt.title('Time Spent In Hospital (in hours)')
plt.legend(title='Times Readmitted', loc='upper right')
plt.xlabel("Time Spent In Hospital (hours)")
plt.tick_params(axis='x', rotation=90)
Observations
fig, ax = plt.subplots(figsize=(15,7))
sns.barplot(data=df, x='num_lab_procedures', y='readmitted')
plt.title('Labs Taken Once Admitted')
#plt.legend(title='Times Readmitted', loc='upper right')
plt.xlabel("Labs Taken")
Text(0.5, 0, 'Labs Taken')
Observations
fig, ax = plt.subplots(figsize=(15,7))
sns.countplot(data=df, x='num_procedures', hue='readmitted')
plt.title('Number of Previous Procedures Performed')
plt.legend(title='Times Readmitted', loc='upper right')
plt.xlabel("Previous Procedures")
Text(0.5, 0, 'Previous Procedures')
Observations
fig, ax = plt.subplots(figsize=(15,7))
sns.countplot(data=df, x='num_medications', hue='readmitted')
plt.title('Number of Medications Patient Are On')
plt.legend(title='Times Readmitted', loc='upper right')
plt.xlabel("Medication Count")
plt.tick_params(axis='x', rotation=-90)
Observations
fig, ax = plt.subplots(figsize=(15,7))
sns.countplot(data=df, x='number_outpatient', hue='readmitted')
plt.title('Number of Previous Outpatient Visits, Previous Yr')
plt.legend(title='Times Readmitted', loc='upper right')
plt.xlabel("Previous Outpatient Visits, Previous Yr")
plt.tick_params(axis='x', rotation=90)
Observations
fig, ax = plt.subplots(figsize=(15,7))
sns.countplot(data=df, x='number_emergency', hue='readmitted')
plt.title('Number of Previous Emergency Room Visits, Previous Yr')
plt.legend(title='Times Readmitted', loc='upper right')
plt.xlabel("Previous Emergency Room Visits, Previous Yr")
plt.tick_params(axis='x', rotation=90)
Observations
fig, ax = plt.subplots(figsize=(15,7))
sns.countplot(data=df, x='number_inpatient', hue='readmitted')
plt.title('Number of Previous Inpatient Admissions, Previous Yr')
plt.legend(title='Times Readmitted', loc='upper right')
plt.xlabel("Previous Inpatient Admissions, Previous Yr")
plt.tick_params(axis='x', rotation=90)
Observations
px.histogram(df, x='max_glu_serum', color='readmitted', title='Maximum Glucose Serum', text_auto=True)
Observations
px.histogram(df, x='A1Cresult', color='readmitted', title='A1C Result', text_auto=True)
Observations
px.histogram(df, x='metformin', color='readmitted', title='Drug: Metformin for High Blood Sugar', text_auto=True)
Observations
px.histogram(df, x='repaglinide', color='readmitted', title='Drug: Repaglinide for High Blood Sugar', text_auto=True)
Observations
px.histogram(df, x='nateglinide', color='readmitted', title='Drug: Nateglinide for High Blood Sugar', text_auto=True)
Observations
px.histogram(df, x='chlorpropamide', color='readmitted', title='Drug: Chlorpropamide for High Blood Sugar', text_auto=True)
Observations
px.histogram(df, x='glimepiride', color='readmitted', title='Drug: glimepiride for High Blood Sugar', text_auto=True)
Observations
px.histogram(df, x='acetohexamide', color='readmitted', title='Drug: acetohexamide for High Blood Sugar', text_auto=True)
Observations
px.histogram(df, x='glipizide', color='readmitted', title='Drug: glipizide for High Blood Sugar', text_auto=True)
Observations
px.histogram(df, x='glyburide', color='readmitted', title='Drug: glyburide for High Blood Sugar', text_auto=True)
Observations
px.histogram(df, x='tolbutamide', color='readmitted', title='Drug: tolbutamide for High Blood Sugar', text_auto=True)
Observations
px.histogram(df, x='pioglitazone', color='readmitted', title='Drug: pioglitazone for High Blood Sugar', text_auto=True)
Observations
px.histogram(df, x='rosiglitazone', color='readmitted', title='Drug: rosiglitazone for High Blood Sugar', text_auto=True)
Observations
px.histogram(df, x='acarbose', color='readmitted', title='Drug: acarbose for High Blood Sugar', text_auto=True)
Observations
px.histogram(df, x='miglitol', color='readmitted', title='Drug: miglitol for High Blood Sugar', text_auto=True)
Observations
px.histogram(df, x='troglitazone', color='readmitted', title='Drug: troglitazone for High Blood Sugar', text_auto=True)
Observations
px.histogram(df, x='tolazamide', color='readmitted', title='Drug: tolazamide for High Blood Sugar', text_auto=True)
px.histogram(df, x='examide', color='readmitted', title='Drug: Examide for Fluid Overload', text_auto=True)
Observations
px.histogram(df, x='citoglipton', color='readmitted', title='Drug: Citoglipton for High Blood Sugar', text_auto=True)
Observations
px.histogram(df, x='insulin', color='readmitted', title='Drug: Insulin for Controlling Blood Sugar', text_auto=True)
Observations
px.histogram(df, x='glyburide-metformin', color='readmitted', title='Drug: glyburide-metformin for', text_auto=True)
Observations
px.histogram(df, x='glipizide-metformin', color='readmitted', title='Drug: glipizide-metformin for High Blood Sugar', text_auto=True)
Observations
px.histogram(df, x='glimepiride-pioglitazone', color='readmitted', title='Drug: glimepiride-pioglitazone for', text_auto=True)
Observations
px.histogram(df, x='metformin-rosiglitazone', color='readmitted', title='Drug: metformin-rosiglitazone for', text_auto=True)
Observations
px.histogram(df, x='metformin-pioglitazone', color='readmitted', title='Drug: metformin-pioglitazone for ', text_auto=True)
Observations
px.histogram(df, x='change', color='readmitted', title='Change Of Medications', text_auto=True)
Observations
px.histogram(df, x='diabetesMed', color='readmitted', title='Number of Diabetes Medications', text_auto=True)
Observations
#mapped columns. Done with visuals. we can drop those columns now.
df.drop(columns=['admission_type_id_mapped', 'discharge_disposition_id_mapped'], inplace=True)
Identification of Business Goals
Now that basic business goals and objectives have been stated, we will build a basic model to get started. Before we can do this, we must work to encode the data where the values are not already numerical. I will prepare the features and target column for modeling with appropriate encoding and transformations.
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 101766 entries, 0 to 101765 Data columns (total 46 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 race 101766 non-null object 1 gender 101766 non-null object 2 age 101766 non-null int64 3 admission_type_id 101766 non-null int64 4 discharge_disposition_id 101766 non-null int64 5 admission_source_id 101766 non-null int64 6 time_in_hospital 101766 non-null int64 7 medical_specialty 101766 non-null object 8 num_lab_procedures 101766 non-null int64 9 num_procedures 101766 non-null int64 10 num_medications 101766 non-null int64 11 number_outpatient 101766 non-null int64 12 number_emergency 101766 non-null int64 13 number_inpatient 101766 non-null int64 14 diag_1 101766 non-null object 15 diag_2 101766 non-null object 16 diag_3 101766 non-null object 17 number_diagnoses 101766 non-null int64 18 max_glu_serum 101766 non-null object 19 A1Cresult 101766 non-null object 20 metformin 101766 non-null object 21 repaglinide 101766 non-null object 22 nateglinide 101766 non-null object 23 chlorpropamide 101766 non-null object 24 glimepiride 101766 non-null object 25 acetohexamide 101766 non-null object 26 glipizide 101766 non-null object 27 glyburide 101766 non-null object 28 tolbutamide 101766 non-null object 29 pioglitazone 101766 non-null object 30 rosiglitazone 101766 non-null object 31 acarbose 101766 non-null object 32 miglitol 101766 non-null object 33 troglitazone 101766 non-null object 34 tolazamide 101766 non-null object 35 examide 101766 non-null object 36 citoglipton 101766 non-null object 37 insulin 101766 non-null object 38 glyburide-metformin 101766 non-null object 39 glipizide-metformin 101766 non-null object 40 glimepiride-pioglitazone 101766 non-null object 41 metformin-rosiglitazone 101766 non-null object 42 metformin-pioglitazone 101766 non-null object 43 change 101766 non-null object 44 diabetesMed 101766 non-null object 45 readmitted 101766 non-null object dtypes: int64(12), object(34) memory usage: 35.7+ MB
#Just to get a general idea of where we stand on count for each of our values for target feature
df['readmitted'].value_counts()
No Readmission 54864 Readmitted 46902 Name: readmitted, dtype: int64
#Update gender to be numerical value
df['gender'] = df['gender'].replace({'Male':1, 'Female':2, 'Unknown/Invalid':3})
#Identify and define the numerical and categorial columns
numerical_cols = [
'age',
'gender',
'admission_type_id',
'discharge_disposition_id',
'admission_source_id',
'time_in_hospital',
'num_lab_procedures',
'num_procedures',
'num_medications',
'number_outpatient',
'number_emergency',
'number_inpatient',
'number_diagnoses'
]
categorical_cols = ['race',
'medical_specialty',
'diag_1',
'diag_2',
'diag_3',
'max_glu_serum',
'A1Cresult',
'metformin',
'repaglinide',
'nateglinide',
'chlorpropamide',
'glimepiride',
'acetohexamide',
'glipizide',
'glyburide',
'tolbutamide',
'pioglitazone',
'rosiglitazone',
'acarbose',
'miglitol',
'troglitazone',
'tolazamide',
'examide',
'citoglipton',
'insulin',
'glyburide-metformin',
'glipizide-metformin',
'glimepiride-pioglitazone',
'metformin-rosiglitazone',
'metformin-pioglitazone',
'change',
'diabetesMed',
'readmitted']
#Here we will use our Label Encoder to help digitize those columns which are not int/float
encoder_df = df.copy()
encoder = preprocessing.LabelEncoder()
def target_encoder(data):
impute_ordinal = encoder.fit_transform(data)
data.loc[data.notnull()] = np.squeeze(impute_ordinal)
return data
for i in tqdm(range(len(categorical_cols))):
target_encoder(encoder_df[categorical_cols[i]])
100%|██████████| 33/33 [00:02<00:00, 11.25it/s]
With our data prepared, split it into a train and test set
X = encoder_df.drop('readmitted', axis=1)
y = encoder_df['readmitted'].astype('int')
#print(y)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
Before we build our first model, we want to establish a baseline.</p>
models = ['Baseline','Logistic Regression', 'Decision Tree']
train_time = []
train_accuracy = []
test_accuracy = []
accuracy_score = []
AUC_score = []
dummy = DummyClassifier(strategy='uniform', random_state=42)
start_time = time()
dummy.fit(X_train, y_train)
train_time.append(time() - start_time)
train_accuracy.append(dummy.score(X_train, y_train))
test_accuracy.append(dummy.score(X_test, y_test))
accuracy_score.append('N/A')
AUC_score.append('N/A')
print("#################BASELINE ANALYSIS####################\n")
print(f'Training time :{train_time}')
print(f'Training accuracy :{train_accuracy}')
print(f'Test accuracy :{test_accuracy}')
print(f'Accuracy score : Not Yet Available')
print(f'AUC score : Not Yet Available')
print("\n######################################################")
#################BASELINE ANALYSIS#################### Training time :[0.0029990673065185547] Training accuracy :[0.4976837554045707] Test accuracy :[0.49924664264657714] Accuracy score : Not Yet Available AUC score : Not Yet Available ######################################################
maxit = 100000
lgr = LogisticRegression(solver='liblinear', random_state=42, max_iter=maxit)
start_time = time()
lgr.fit(X_train, y_train)
train_time.append(time() - start_time)
y_pred = lgr.predict(X_test)
train_accuracy.append(lgr.score(X_train, y_train))
test_accuracy.append(lgr.score(X_test, y_test))
accuracy_score.append(metrics.accuracy_score(y_test, y_pred))
fpr, tpr, _thresholds = metrics.roc_curve(y_test, y_pred)
AUC_score.append(metrics.auc(fpr, tpr))
print("#################LOGISTIC REGRESSION ANALYSIS####################\n")
print(f'Training time :{train_time[1]}')
print(f'Training accuracy :{train_accuracy[1]}')
print(f'Test accuracy :{test_accuracy[1]}')
print(f'Accuracy score : {accuracy_score[1]}')
print(f'AUC score : {AUC_score[1]}')
print("\n######################################################")
#################LOGISTIC REGRESSION ANALYSIS#################### Training time :3.8975634574890137 Training accuracy :0.6180582851367287 Test accuracy :0.6188994431706518 Accuracy score : 0.6188994431706518 AUC score : 0.6034773037012175 ######################################################
cm_plot = sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, cmap='icefire', fmt='d')
cm_plot.set_xlabel('Predictions')
cm_plot.set_ylabel('Actuals')
plt.title("Logistic Regression")
Text(0.5, 1.0, 'Logistic Regression')
from pandas.core.common import random_state
model_pipeline = []
model_pipeline.append(DecisionTreeClassifier(random_state=42))
cm_results = []
model_pipeline
[DecisionTreeClassifier(random_state=42)]
for model in model_pipeline:
start_time = time()
model.fit(X_train, y_train)
train_time.append(time() - start_time)
y_pred = model.predict(X_test)
train_accuracy.append(model.score(X_train, y_train))
test_accuracy.append(model.score(X_test, y_test))
accuracy_score.append(metrics.accuracy_score(y_test, y_pred))
fpr, tpr, _thresholds = metrics.roc_curve(y_test, y_pred)
AUC_score.append(metrics.auc(fpr, tpr))
cm_results.append(confusion_matrix(y_test, y_pred))
fig = plt.figure(figsize=(12, 10))
for i in range(len(cm_results)):
cm = cm_results[i]
model = models[i+2]
sub = fig.add_subplot(2, 2, i+1).set_title(model)
cm_plot = sns.heatmap(cm, annot=True, cmap='icefire', fmt='d')
cm_plot.set_xlabel('Predictions')
cm_plot.set_ylabel('Actuals')
results1_df = pd.DataFrame({'Model': models, 'Train Time': train_time, 'Train Score': train_accuracy , 'Test Score': test_accuracy , 'Accuracy Score': accuracy_score, 'AUC Score': AUC_score})
results1_df
| Model | Train Time | Train Score | Test Score | Accuracy Score | AUC Score | |
|---|---|---|---|---|---|---|
| 0 | Baseline | 0.002999 | 0.497684 | 0.499247 | N/A | N/A |
| 1 | Logistic Regression | 3.897563 | 0.618058 | 0.618899 | 0.618899 | 0.603477 |
| 2 | Decision Tree | 0.859580 | 1.000000 | 0.563904 | 0.563904 | 0.56149 |
models = ['Logistic Regression', 'Decision Tree']
best_params = []
fit_time = []
acc_score = []
recall_score = []
prec_score = []
f1_score = []
r2_score = []
roc_auc_score = []
grid={"C":np.logspace(-3,3,7), "penalty":["l1","l2"]}# l1-Lasso l2-Ridge
lgr = LogisticRegression(max_iter=maxit)
lgr_cv=GridSearchCV(lgr, grid, cv=10)
lgr_cv.fit(X_train, y_train)
GridSearchCV(cv=10, estimator=LogisticRegression(max_iter=100000),
param_grid={'C': array([1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02, 1.e+03]),
'penalty': ['l1', 'l2']})
lgr_cv.best_params_
{'C': 0.01, 'penalty': 'l2'}
lgr2=LogisticRegression(C=0.1, penalty="l2", random_state=42, max_iter=maxit)
start_time = time()
lgr2.fit(X_train,y_train)
train_time = time() - start_time
y_pred = lgr2.predict(X_test)
cm_plot = sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, cmap='icefire', fmt='d')
cm_plot.set_xlabel('Predictions')
cm_plot.set_ylabel('Actuals')
plt.title("Performance Confusion Matrix")
Text(0.5, 1.0, 'Performance Confusion Matrix')
best_params.append(lgr_cv.best_params_)
fit_time.append(train_time)
acc_score.append(metrics.accuracy_score(y_test, y_pred))
recall_score.append(metrics.recall_score(y_test, y_pred))
prec_score.append(metrics.precision_score(y_test, y_pred))
f1_score.append(metrics.f1_score(y_test, y_pred))
r2_score.append(metrics.r2_score(y_test, y_pred))
roc_auc_score.append(metrics.roc_auc_score(y_test, y_pred))
importance = lgr2.coef_[0]
for i,v in enumerate(importance):
print(f'Feature: {df.columns[i]} \t Score:{v}')
plt.bar([x for x in range(len(importance))], importance)
plt.show()
Feature: race Score:-0.033796467119041725 Feature: gender Score:0.05220811772492702 Feature: age Score:0.0029344954582298266 Feature: admission_type_id Score:0.032138882815709854 Feature: discharge_disposition_id Score:-0.012756074107160375 Feature: admission_source_id Score:0.005697368221586632 Feature: time_in_hospital Score:0.015309280096100903 Feature: medical_specialty Score:0.0009692045401855414 Feature: num_lab_procedures Score:0.0009561158077616839 Feature: num_procedures Score:-0.04584900082060058 Feature: num_medications Score:0.0009861661339249123 Feature: number_outpatient Score:0.08235667848348943 Feature: number_emergency Score:0.1888097939190253 Feature: number_inpatient Score:0.3728662559579259 Feature: diag_1 Score:-0.00013084404297845518 Feature: diag_2 Score:-0.00016948616423528032 Feature: diag_3 Score:3.229928653600553e-05 Feature: number_diagnoses Score:0.08312833535252362 Feature: max_glu_serum Score:-0.0191027671164415 Feature: A1Cresult Score:-0.02281927899061262 Feature: metformin Score:-0.11773066366763427 Feature: repaglinide Score:-0.016584307605706474 Feature: nateglinide Score:-0.19847540350258402 Feature: chlorpropamide Score:-0.2624410900601061 Feature: glimepiride Score:-0.02035639009589943 Feature: acetohexamide Score:0.004290645238704124 Feature: glipizide Score:0.06904377139404178 Feature: glyburide Score:0.007883757005958742 Feature: tolbutamide Score:-0.023482659033740666 Feature: pioglitazone Score:0.0760129295032895 Feature: rosiglitazone Score:0.06253838495541769 Feature: acarbose Score:-0.16133545826556323 Feature: miglitol Score:-0.3045983331251128 Feature: troglitazone Score:0.004864037064982501 Feature: tolazamide Score:-0.00474502797746427 Feature: examide Score:0.0 Feature: citoglipton Score:0.0 Feature: insulin Score:-0.0677520106092102 Feature: glyburide-metformin Score:-0.14107438654274135 Feature: glipizide-metformin Score:0.004826525999850238 Feature: glimepiride-pioglitazone Score:0.003183159044823325 Feature: metformin-rosiglitazone Score:-0.0030582987827478575 Feature: metformin-pioglitazone Score:-0.0028793596105040455 Feature: change Score:-0.046693629295805776 Feature: diabetesMed Score:0.2760422737557418
encoder_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 101766 entries, 0 to 101765 Data columns (total 46 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 race 101766 non-null object 1 gender 101766 non-null int64 2 age 101766 non-null int64 3 admission_type_id 101766 non-null int64 4 discharge_disposition_id 101766 non-null int64 5 admission_source_id 101766 non-null int64 6 time_in_hospital 101766 non-null int64 7 medical_specialty 101766 non-null object 8 num_lab_procedures 101766 non-null int64 9 num_procedures 101766 non-null int64 10 num_medications 101766 non-null int64 11 number_outpatient 101766 non-null int64 12 number_emergency 101766 non-null int64 13 number_inpatient 101766 non-null int64 14 diag_1 101766 non-null object 15 diag_2 101766 non-null object 16 diag_3 101766 non-null object 17 number_diagnoses 101766 non-null int64 18 max_glu_serum 101766 non-null object 19 A1Cresult 101766 non-null object 20 metformin 101766 non-null object 21 repaglinide 101766 non-null object 22 nateglinide 101766 non-null object 23 chlorpropamide 101766 non-null object 24 glimepiride 101766 non-null object 25 acetohexamide 101766 non-null object 26 glipizide 101766 non-null object 27 glyburide 101766 non-null object 28 tolbutamide 101766 non-null object 29 pioglitazone 101766 non-null object 30 rosiglitazone 101766 non-null object 31 acarbose 101766 non-null object 32 miglitol 101766 non-null object 33 troglitazone 101766 non-null object 34 tolazamide 101766 non-null object 35 examide 101766 non-null object 36 citoglipton 101766 non-null object 37 insulin 101766 non-null object 38 glyburide-metformin 101766 non-null object 39 glipizide-metformin 101766 non-null object 40 glimepiride-pioglitazone 101766 non-null object 41 metformin-rosiglitazone 101766 non-null object 42 metformin-pioglitazone 101766 non-null object 43 change 101766 non-null object 44 diabetesMed 101766 non-null object 45 readmitted 101766 non-null object dtypes: int64(13), object(33) memory usage: 35.7+ MB
params = {'max_depth': [1, 3, 5, 7, 11, 13, 15, 17, 19, 21, 23],
'min_samples_split': [1, 50, 2],
'criterion': ['gini', 'entropy'],
'min_samples_leaf': [1]}
dtree = DecisionTreeClassifier()
dtree_cv = GridSearchCV(dtree, params, cv=10)
dtree_cv.fit(X_train, y_train)
GridSearchCV(cv=10, estimator=DecisionTreeClassifier(),
param_grid={'criterion': ['gini', 'entropy'],
'max_depth': [1, 3, 5, 7, 11, 13, 15, 17, 19, 21, 23],
'min_samples_leaf': [1],
'min_samples_split': [1, 50, 2]})
dtree_cv.best_params_
{'criterion': 'entropy',
'max_depth': 7,
'min_samples_leaf': 1,
'min_samples_split': 2}
dtree2=DecisionTreeClassifier(criterion='gini', max_depth=7, min_samples_leaf=1, min_samples_split=50, random_state=42)
start_time = time()
dtree2.fit(X_train,y_train)
train_time = time() - start_time
y_pred = dtree2.predict(X_test)
cm_plot = sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, cmap='icefire', fmt='d')
cm_plot.set_xlabel('Predictions')
cm_plot.set_ylabel('Actuals')
plt.title("Performance Confusion Matrix")
Text(0.5, 1.0, 'Performance Confusion Matrix')
best_params.append(dtree_cv.best_params_)
fit_time.append(train_time)
acc_score.append(metrics.accuracy_score(y_test, y_pred))
recall_score.append(metrics.recall_score(y_test, y_pred))
prec_score.append(metrics.precision_score(y_test, y_pred))
f1_score.append(metrics.f1_score(y_test, y_pred))
r2_score.append(metrics.r2_score(y_test, y_pred))
roc_auc_score.append(metrics.roc_auc_score(y_test, y_pred))
dtree_coeffs = pd.Series(dtree2.feature_importances_, index=X_train.columns)
dtree_coeffs.plot(kind='bar', figsize=(15,10))
plt.title('Decision Tree Coefficients')
plt.xlabel('X_train Values')
plt.tick_params(axis='x', rotation=-80)
#top 15 features, according to decision tree
dtree_coeffs.sort_values(ascending=False).head(15)
number_inpatient 0.508212 discharge_disposition_id 0.195538 number_diagnoses 0.076576 number_outpatient 0.035812 number_emergency 0.034875 admission_source_id 0.034073 diabetesMed 0.027287 age 0.026784 num_medications 0.009788 diag_3 0.008649 num_lab_procedures 0.008617 time_in_hospital 0.008107 diag_1 0.006987 admission_type_id 0.004713 num_procedures 0.004016 dtype: float64
results2_df = pd.DataFrame({'Model': models, 'Best_Params': best_params, 'Train_Time': fit_time, 'Accuracy_Score': acc_score, 'Recall_Score': recall_score, 'Precision_Score': prec_score, 'F1_Score': f1_score, 'R2_Score': r2_score, 'ROC_AUC_Score': roc_auc_score})
results2_df
| Model | Best_Params | Train_Time | Accuracy_Score | Recall_Score | Precision_Score | F1_Score | R2_Score | ROC_AUC_Score | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Logistic Regression | {'C': 0.01, 'penalty': 'l2'} | 27.010921 | 0.617982 | 0.407136 | 0.632928 | 0.495523 | -0.537509 | 0.602663 |
| 1 | Decision Tree | {'criterion': 'entropy', 'max_depth': 7, 'min_... | 0.403614 | 0.633344 | 0.497548 | 0.629213 | 0.555688 | -0.475682 | 0.623478 |
pd.set_option('display.max_columns',None)
X = encoder_df.drop('readmitted', axis=1)
y = encoder_df['readmitted'].astype('int')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 123, stratify = y)
encoder_df.columns
Index(['race', 'gender', 'age', 'admission_type_id',
'discharge_disposition_id', 'admission_source_id', 'time_in_hospital',
'medical_specialty', 'num_lab_procedures', 'num_procedures',
'num_medications', 'number_outpatient', 'number_emergency',
'number_inpatient', 'diag_1', 'diag_2', 'diag_3', 'number_diagnoses',
'max_glu_serum', 'A1Cresult', 'metformin', 'repaglinide', 'nateglinide',
'chlorpropamide', 'glimepiride', 'acetohexamide', 'glipizide',
'glyburide', 'tolbutamide', 'pioglitazone', 'rosiglitazone', 'acarbose',
'miglitol', 'troglitazone', 'tolazamide', 'examide', 'citoglipton',
'insulin', 'glyburide-metformin', 'glipizide-metformin',
'glimepiride-pioglitazone', 'metformin-rosiglitazone',
'metformin-pioglitazone', 'change', 'diabetesMed', 'readmitted'],
dtype='object')
#array of features that a physician can actually influence in an attempt to keep a diabetic
# patient healthier. some features such as race, age, gender, etc. cannot be modified to change health outcome.
# Thus, we omit from our continuous feature array. We only want items that can be manipulated by medication, nutrition,
# excercise, and/or some other means deemed acceptable by physicians/researchers.
cf_array = ['num_medications', 'number_outpatient', 'number_emergency',
'number_inpatient', 'diag_1', 'diag_2', 'diag_3', 'number_diagnoses',
'max_glu_serum', 'A1Cresult', 'metformin', 'repaglinide', 'nateglinide',
'chlorpropamide', 'glimepiride', 'acetohexamide', 'glipizide',
'glyburide', 'tolbutamide', 'pioglitazone', 'rosiglitazone', 'acarbose',
'miglitol', 'troglitazone', 'tolazamide', 'examide', 'citoglipton',
'insulin', 'glyburide-metformin', 'glipizide-metformin',
'glimepiride-pioglitazone', 'metformin-rosiglitazone',
'metformin-pioglitazone', 'change', 'diabetesMed']
# dice complained about not readily recognizing the int values. encoder_df.info()
# showed them as objects. so, manually converting over to int type.
encoder_df['diag_1'] = encoder_df['diag_1'].astype('float')
encoder_df['diag_2'] = encoder_df['diag_1'].astype('float')
encoder_df['diag_3'] = encoder_df['diag_1'].astype('float')
encoder_df['max_glu_serum'] = encoder_df['max_glu_serum'].astype('int')
encoder_df['A1Cresult'] = encoder_df['A1Cresult'].astype('int')
encoder_df['max_glu_serum'] = encoder_df['max_glu_serum'].astype('int')
encoder_df['metformin'] = encoder_df['max_glu_serum'].astype('int')
encoder_df['repaglinide'] = encoder_df['repaglinide'].astype('int')
encoder_df['nateglinide'] = encoder_df['nateglinide'].astype('int')
encoder_df['chlorpropamide'] = encoder_df['chlorpropamide'].astype('int')
encoder_df['glimepiride'] = encoder_df['glimepiride'].astype('int')
encoder_df['acetohexamide'] = encoder_df['acetohexamide'].astype('int')
encoder_df['glipizide'] = encoder_df['glipizide'].astype('int')
encoder_df['glyburide'] = encoder_df['glyburide'].astype('int')
encoder_df['tolbutamide'] = encoder_df['tolbutamide'].astype('int')
encoder_df['pioglitazone'] = encoder_df['pioglitazone'].astype('int')
encoder_df['rosiglitazone'] = encoder_df['rosiglitazone'].astype('int')
encoder_df['acarbose'] = encoder_df['acarbose'].astype('int')
encoder_df['miglitol'] = encoder_df['miglitol'].astype('int')
encoder_df['troglitazone'] = encoder_df['troglitazone'].astype('int')
encoder_df['tolazamide'] = encoder_df['tolazamide'].astype('int')
encoder_df['examide'] = encoder_df['examide'].astype('int')
encoder_df['citoglipton'] = encoder_df['citoglipton'].astype('int')
encoder_df['insulin'] = encoder_df['insulin'].astype('int')
encoder_df['glyburide-metformin'] = encoder_df['glyburide-metformin'].astype('int')
encoder_df['glipizide-metformin'] = encoder_df['glipizide-metformin'].astype('int')
encoder_df['glimepiride-pioglitazone'] = encoder_df['glimepiride-pioglitazone'].astype('int')
encoder_df['metformin-rosiglitazone'] = encoder_df['metformin-rosiglitazone'].astype('int')
encoder_df['metformin-pioglitazone'] = encoder_df['metformin-pioglitazone'].astype('int')
encoder_df['change'] = encoder_df['change'].astype('int')
encoder_df['diabetesMed'] = encoder_df['diabetesMed'].astype('int')
# step 1 - dice_ml.Data
d = dice_ml.Data(dataframe = encoder_df,
continuous_features = cf_array, outcome_name = 'readmitted')
d
<dice_ml.data_interfaces.public_data_interface.PublicData at 0x20bb0175160>
Diverse Counterfactual Explanations (DiCE) for ML: https://github.com/interpretml/DiCE
dtclf = DecisionTreeClassifier().fit(X_train, y_train)
# use the sklearn backend
m = dice_ml.Model(model = dtclf, backend = "sklearn")
exp = dice_ml.Dice(d, m, method = 'random')
e1 = exp.generate_counterfactuals(X_test[0:1],
total_CFs = 2,
desired_class = "opposite")
e1.visualize_as_dataframe()
100%|██████████| 1/1 [00:00<00:00, 2.83it/s]
Query instance (original outcome : 1)
| race | gender | age | admission_type_id | discharge_disposition_id | admission_source_id | time_in_hospital | medical_specialty | num_lab_procedures | num_procedures | num_medications | number_outpatient | number_emergency | number_inpatient | diag_1 | diag_2 | diag_3 | number_diagnoses | max_glu_serum | A1Cresult | metformin | repaglinide | nateglinide | chlorpropamide | glimepiride | acetohexamide | glipizide | glyburide | tolbutamide | pioglitazone | rosiglitazone | acarbose | miglitol | troglitazone | tolazamide | examide | citoglipton | insulin | glyburide-metformin | glipizide-metformin | glimepiride-pioglitazone | metformin-rosiglitazone | metformin-pioglitazone | change | diabetesMed | readmitted | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 | 2 | 65 | 1 | 1 | 7 | 4 | 71 | 37 | 0 | 15 | 0 | 0 | 3 | 329.0 | 333.0 | 246.0 | 9 | 2 | 2 | 2 | 2 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
Diverse Counterfactual set (new outcome: 0.0)
| race | gender | age | admission_type_id | discharge_disposition_id | admission_source_id | time_in_hospital | medical_specialty | num_lab_procedures | num_procedures | num_medications | number_outpatient | number_emergency | number_inpatient | diag_1 | diag_2 | diag_3 | number_diagnoses | max_glu_serum | A1Cresult | metformin | repaglinide | nateglinide | chlorpropamide | glimepiride | acetohexamide | glipizide | glyburide | tolbutamide | pioglitazone | rosiglitazone | acarbose | miglitol | troglitazone | tolazamide | examide | citoglipton | insulin | glyburide-metformin | glipizide-metformin | glimepiride-pioglitazone | metformin-rosiglitazone | metformin-pioglitazone | change | diabetesMed | readmitted | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 | 2 | 65 | 1 | 1 | 7 | 4 | 71 | 37 | 0 | 15 | 0 | 0 | 3 | 329 | 672.6 | 246 | 9 | 2 | 2 | 2 | 2 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 1 | 2 | 2 | 65 | 1 | 1 | 7 | 4 | 71 | 37 | 6 | 15 | 0 | 0 | 3 | 329 | 333 | 246 | 9 | 2 | 2 | 2 | 2 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1.0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
e2 = exp.generate_counterfactuals(X_test[0:1],
total_CFs = 2,
desired_class = "opposite",
features_to_vary = cf_array)
e2.visualize_as_dataframe()
100%|██████████| 1/1 [00:00<00:00, 2.86it/s]
Query instance (original outcome : 1)
| race | gender | age | admission_type_id | discharge_disposition_id | admission_source_id | time_in_hospital | medical_specialty | num_lab_procedures | num_procedures | num_medications | number_outpatient | number_emergency | number_inpatient | diag_1 | diag_2 | diag_3 | number_diagnoses | max_glu_serum | A1Cresult | metformin | repaglinide | nateglinide | chlorpropamide | glimepiride | acetohexamide | glipizide | glyburide | tolbutamide | pioglitazone | rosiglitazone | acarbose | miglitol | troglitazone | tolazamide | examide | citoglipton | insulin | glyburide-metformin | glipizide-metformin | glimepiride-pioglitazone | metformin-rosiglitazone | metformin-pioglitazone | change | diabetesMed | readmitted | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 | 2 | 65 | 1 | 1 | 7 | 4 | 71 | 37 | 0 | 15 | 0 | 0 | 3 | 329.0 | 333.0 | 246.0 | 9 | 2 | 2 | 2 | 2 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
Diverse Counterfactual set (new outcome: 0.0)
| race | gender | age | admission_type_id | discharge_disposition_id | admission_source_id | time_in_hospital | medical_specialty | num_lab_procedures | num_procedures | num_medications | number_outpatient | number_emergency | number_inpatient | diag_1 | diag_2 | diag_3 | number_diagnoses | max_glu_serum | A1Cresult | metformin | repaglinide | nateglinide | chlorpropamide | glimepiride | acetohexamide | glipizide | glyburide | tolbutamide | pioglitazone | rosiglitazone | acarbose | miglitol | troglitazone | tolazamide | examide | citoglipton | insulin | glyburide-metformin | glipizide-metformin | glimepiride-pioglitazone | metformin-rosiglitazone | metformin-pioglitazone | change | diabetesMed | readmitted | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 | 2 | 65 | 1 | 1 | 7 | 4 | 71 | 37 | 0 | 15 | 0 | 0 | 3 | 329 | 333 | 246 | 9 | 2 | 2 | 2 | 2 | 1 | 1 | 1 | 0 | 0.0 | 0.0 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 1 | 2 | 2 | 65 | 1 | 1 | 7 | 4 | 71 | 37 | 0 | 15 | 0 | 0 | 3 | 329 | 591.3 | 246 | 9 | 2 | 2 | 2 | 2 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |